1,232 research outputs found

    Cross layer reliability estimation for digital systems

    Get PDF
    Forthcoming manufacturing technologies hold the promise to increase multifuctional computing systems performance and functionality thanks to a remarkable growth of the device integration density. Despite the benefits introduced by this technology improvements, reliability is becoming a key challenge for the semiconductor industry. With transistor size reaching the atomic dimensions, vulnerability to unavoidable fluctuations in the manufacturing process and environmental stress rise dramatically. Failing to meet a reliability requirement may add excessive re-design cost to recover and may have severe consequences on the success of a product. %Worst-case design with large margins to guarantee reliable operation has been employed for long time. However, it is reaching a limit that makes it economically unsustainable due to its performance, area, and power cost. One of the open challenges for future technologies is building ``dependable'' systems on top of unreliable components, which will degrade and even fail during normal lifetime of the chip. Conventional design techniques are highly inefficient. They expend significant amount of energy to tolerate the device unpredictability by adding safety margins to a circuit's operating voltage, clock frequency or charge stored per bit. Unfortunately, the additional cost introduced to compensate unreliability are rapidly becoming unacceptable in today's environment where power consumption is often the limiting factor for integrated circuit performance, and energy efficiency is a top concern. Attention should be payed to tailor techniques to improve the reliability of a system on the basis of its requirements, ending up with cost-effective solutions favoring the success of the product on the market. Cross-layer reliability is one of the most promising approaches to achieve this goal. Cross-layer reliability techniques take into account the interactions between the layers composing a complex system (i.e., technology, hardware and software layers) to implement efficient cross-layer fault mitigation mechanisms. Fault tolerance mechanism are carefully implemented at different layers starting from the technology up to the software layer to carefully optimize the system by exploiting the inner capability of each layer to mask lower level faults. For this purpose, cross-layer reliability design techniques need to be complemented with cross-layer reliability evaluation tools, able to precisely assess the reliability level of a selected design early in the design cycle. Accurate and early reliability estimates would enable the exploration of the system design space and the optimization of multiple constraints such as performance, power consumption, cost and reliability. This Ph.D. thesis is devoted to the development of new methodologies and tools to evaluate and optimize the reliability of complex digital systems during the early design stages. More specifically, techniques addressing hardware accelerators (i.e., FPGAs and GPUs), microprocessors and full systems are discussed. All developed methodologies are presented in conjunction with their application to real-world use cases belonging to different computational domains

    ReDO: Cross-Layer Multi-Objective Design-Exploration Framework for Efficient Soft Error Resilient Systems

    Get PDF
    Designing soft errors resilient systems is a complex engineering task, which nowadays follows a cross-layer approach. It requires a careful planning for different fault-tolerance mechanisms at different system's layers: starting from the technology up to the software domain. While these design decisions have a positive effect on the reliability of the system, they usually have a detrimental effect on its size, power consumption, performance and cost. Design space exploration for cross-layer reliability is therefore a multi-objective search problem in which reliability must be traded-off with other design dimensions. This paper proposes a cross-layer multi-objective design space exploration algorithm developed to help designers when building soft error resilient electronic systems. The algorithm exploits a system-level Bayesian reliability estimation model to analyze the effect of different cross-layer combinations of protection mechanisms on the reliability of the full system. A new heuristic based on the extremal optimization theory is used to efficiently explore the design space. An extended set of simulations shows the capability of this framework when applied both to benchmark applications and realistic systems, providing optimized systems that outperform those obtained by applying state-of-the-art cross-layer reliability techniques

    Shielding Performance Monitor Counters: A double edged weapon for safety and security

    Get PDF
    Recent years have witnessed the growth of the adoption of Cyber-Physical Systems (CPSs) in many sectors such as automotive, aerospace, civil infrastructures and healthcare. Several CPS applications include critical scenarios, where a failure of the system can lead to catastrophic consequences. Therefore, anomalies due to failure or malicious attacks must be timely detected. This paper focuses on two relevant aspects of the design of a CPS: safety and security. In particular, it studies how performance monitor counters (PMCs) available in modern microprocessors can be from the one hand a valuable tool to enhance the safety of a system and, on the other hand, a security backdoor. Starting from the example of a PMC based safety mechanism, the paper shows the implementation of a possible attack and eventually proposes a strategy to mitigate the effectiveness of the attack while preserving the safeness of the system

    Trading-off reliability and performance in FPGA-based reconfigurable heterogeneous systems

    Get PDF
    Recent years have witnessed the rapid growth of heterogeneous systems, composed of CPUs and hardware accelerators, to face up the constant increase of computational performance demand of digital systems. In this scenario, FPGAs offer the possibility to implement high performance reconfigurable accelerators, able to speed up the intrinsically parallel portions of applications. The study of reconfigurable heterogeneous systems is still maturing and, while some contributions about performance and power consumption are available, in literature there are few works addressing reliability. This paper analyzes reconfigurable heterogeneous systems in presence of permanent faults occurring in the FPGA. In this context, a reconfigurable heterogeneous architecture, including a Run Time Manager responsible for the communication of software tasks and the FPGA, the scheduling and the placement of hardware tasks, is presented. In addition, the paper introduces a reconfigurable heterogeneous system simulator for the proposed architecture. This simulator is able to evaluate during the design phase the degradation of the system performance due to permanent faults and allows to explore the design space dimensions efficiently

    Performance monitor counters: Interplay between safety and security in complex cyber-physical systems

    Get PDF
    Recent years have witnessed the growth of the adoption of cyber-physical systems (CPSs) in many sectors, such as automotive, aerospace, civil infrastructures, and healthcare. Several CPS applications include critical scenarios, where a failure of the system can lead to catastrophic consequences. Therefore, anomalies due to failures or malicious attacks must be detected timely. This paper focuses on two relevant aspects of the design of a CPS: 1) safety and 2) security. It analyzes in a specific scenario how the performance monitor counters (PMCs) available in several commercial microprocessors can be from the one hand a valuable tool to enhance the safety of a system and, on the other hand, a security backdoor. Starting from the example of a PMC-based safety mechanism, this paper shows the implementation of a possible attack and eventually proposes a strategy to mitigate the effectiveness of the attack while preserving the safety of the system

    Microarchitecture level reliability comparison of modern GPU designs: First findings

    Get PDF
    State-of-the-art GPU chips are designed to deliver extreme throughput for graphics as well as for data-parallel general purpose computing workloads (GPGPU computing). Unlike graphics computing, GPGPU computing requires highly reliable operation. The performance-oriented design of GPUs requires to jointly evaluate the vulnerability of GPU workloads to soft-errors with the performance of GPU chips. We briefly present a summary of the findings of an extensive study aiming at the evaluation of the reliability of four GPU architectures and corresponding chips, orrelating them with the performance of the workloads

    Multi-faceted microarchitecture level reliability characterization for NVIDIA and AMD GPUs

    Get PDF
    State-of-the-art GPU chips are designed to deliver extreme throughput for graphics as well as for data-parallel general purpose computing workloads (GPGPU computing). Unlike computing for graphics, GPGPU computing requires highly reliable operations. Since provisioning for high reliability may affect performance, the design of GPGPU systems requires the vulnerability of GPU workloads to soft-errors to be jointly evaluated with the performance of GPU chips. We present an extended study based on a consolidated workflow for the evaluation of the reliability in correlation with the performance of four GPU architectures and corresponding chips: AMD Southern Islands and NVIDIA G80/GT200/Fermi. We obtained reliability measurements (AVF and FIT) employing both fault injection and ACE-analysis based on microarchitecture-level simulators. Apart from the reliability-only and performance-only measurements, we propose combined metrics for performance and reliability that assist comparisons for the same application among GPU chips of different ISAs and vendors, as well as among benchmarks on the same GPU chip

    Cross-layer system reliability assessment framework for hardware faults

    Get PDF
    System reliability estimation during early design phases facilitates informed decisions for the integration of effective protection mechanisms against different classes of hardware faults. When not all system abstraction layers (technology, circuit, microarchitecture, software) are factored in such an estimation model, the delivered reliability reports must be excessively pessimistic and thus lead to unacceptably expensive, over-designed systems. We propose a scalable, cross-layer methodology and supporting suite of tools for accurate but fast estimations of computing systems reliability. The backbone of the methodology is a component-based Bayesian model, which effectively calculates system reliability based on the masking probabilities of individual hardware and software components considering their complex interactions. Our detailed experimental evaluation for different technologies, microarchitectures, and benchmarks demonstrates that the proposed model delivers very accurate reliability estimations (FIT rates) compared to statistically significant but slow fault injection campaigns at the microarchitecture level.Peer ReviewedPostprint (author's final draft

    Cross-Layer Early Reliability Evaluation for the Computing cOntinuum

    Get PDF
    Advanced multifunctional computing systems realized in forthcoming technologies hold the promise of a significant increase of the computational capability that will offer end-users ever improving services and functionalities (e.g., next generation mobile devices, cloud services, etc.). However, the same path that is leading technologies toward these remarkable achievements is also making electronic devices increasingly unreliable, posing a threat to our society that is depending on the ICT in every aspect of human activities. Reliability of electronic systems is therefore a key challenge for the whole ICT technology and must be guaranteed without penalizing or slowing down the characteristics of the final products. CLERECO EU FP7 (GA No. 611404) research project addresses early accurate reliability evaluation and efficient exploitation of reliability at different design phases, since these aspects are two of the most important and challenging tasks toward this goal
    • …
    corecore